Evaluation of Mutual information versus Gini index for stable feature selection
نویسنده
چکیده
The selection of highly discriminatory features has been crucial in aiding further advancements in domains such as biomedical sciences, high-energy physics and e-commerce. Therefore evaluation of the robustness of feature selection methods to small perturbations in the data, known as feature selection stability, is of great importance to people in these respective fields. However, little research has been focused on investigating the stability of feature selection algorithms independently from any learning models. This project address the problem by providing an overview of several established stability measures and reintroduces Pearson’s correlation coefficient as another. The coefficient has then been employed in the empirical evaluation of four commonly used feature selection criteria, Mutual information maximisation, Mutual information feature selection, Gini index and ReliefF. A high overall stability of Mutual information maximisation and Gini index can be observed for small data samples, with a slightly lower stability being seen for ReliefF. All criteria exhibit low stability when applied to high-dimensional datasets, consisting of small number of samples, with Mutual information feature selection performing poorly across all datasets.
منابع مشابه
Feature Selection for Text Classification Based on Gini Coefficient of Inequality
A number of feature selection mechanisms have been explored in text categorization, among which mutual information, information gain and chi-square are considered most effective. In this paper, we study another method known as within class popularity to deal with feature selection based on the concept Gini coefficient of inequality (a commonly used measure of inequality of income). The proposed...
متن کاملFeature Selection Using Multi Objective Genetic Algorithm with Support Vector Machine
Different approaches have been proposed for feature selection to obtain suitable features subset among all features. These methods search feature space for feature subsets which satisfies some criteria or optimizes several objective functions. The objective functions are divided into two main groups: filter and wrapper methods. In filter methods, features subsets are selected due to some measu...
متن کاملStatistical Sources of Variable Selection Bias in Classification Tree Algorithms Based on the Gini Index
Evidence for variable selection bias in classification tree algorithms based on the Gini Index is reviewed from the literature and embedded into a broader explanatory scheme: Variable selection bias in classification tree algorithms based on the Gini Index can be caused not only by the statistical effect of multiple comparisons, but also by an increasing estimation bias and variance of the spli...
متن کاملImproved Feature-Selection Method Considering the Imbalance Problem in Text Categorization
The filtering feature-selection algorithm is a kind of important approach to dimensionality reduction in the field of the text categorization. Most of filtering feature-selection algorithms evaluate the significance of a feature for category based on balanced dataset and do not consider the imbalance factor of dataset. In this paper, a new scheme was proposed, which can weaken the adverse effec...
متن کاملA novel feature selection algorithm for text categorization
With the development of the web, large numbers of documents are available on the Internet. Digital libraries, news sources and inner data of companies surge more and more. Automatic text categorization becomes more and more important for dealing with massive data. However the major problem of text categorization is the high dimensionality of the feature space. At present there are many methods ...
متن کامل